cloud environment
Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments
Ghimire, Sangam, Timalsina, Paribartan, Bhurtel, Nirjal, Neupane, Bishal, Shrestha, Bigyan Byanju, Bhattarai, Subarna, Gaire, Prajwal, Thapa, Jessica, Jha, Sudan
As AI models continue to grow in complexity and size, so does the demand for vast computational resources and access to large-scale distributed datasets. At the same time, growing concerns about data privacy, ownership, and regulatory compliance make it increasingly difficult to centralize data for training. FL has emerged as a promising paradigm for addressing these challenges, enabling the training of collaborative models across multiple data silos without requiring the raw data to leave its source. While FL has gained traction in mobile and edge environments, such as smart-phones and IoT devices, its application in large-scale computing platforms like HPC clusters and cloud infrastructure remains underexplored. Meanwhile, the convergence of HPC and cloud computing is reshaping the landscape of modern data-intensive applications. These hybrid environments combine the raw power and efficiency of HPC with the scalability and flexibility of the cloud, making them well-suited for training large AI models. However, this integration brings new challenges: heterogeneous hardware (e.g., Central Processing Units (CPUs), Graphics Processing Units (GPUs), Tensor Processing Units (TPUs)), inconsistent network performance, dynamic resource availability, and non-uniform data distributions across clients. In this context, the deployment of federated learning across such mixed infrastructure is both a timely opportunity and a technical challenge. This paper explores how FL can be adapted and optimized to run efficiently across heterogeneous HPC and cloud environments, with a focus on scalability, system resilience, and performance under non-IID data conditions.
- South America > Brazil > Rio Grande do Sul > Porto Alegre (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- (2 more...)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management
Kumar, Hrikshesh, Garg, Anika, Gupta, Anshul, Agarwal, Yashika
Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable
- Information Technology > Security & Privacy (1.00)
- Energy (0.94)
- Transportation (0.68)
- Telecommunications (0.67)
Machine learning-based cloud resource allocation algorithms: a comprehensive comparative review
Cloud resource allocation has emerged as a major challenge in modern computing environments, with organizations struggling to manage complex, dynamic workloads while optimizing performance and cost efficiency. Traditional heuristic approaches prove inadequate for handling the multi-objective optimization demands of existing cloud infrastructures. This paper presents a comparative analysis of state-of-the-art artificial intelligence and machine learning algorithms for resource allocation. We systematically evaluate 10 algorithms across four categories: Deep Reinforcement Learning approaches, Neural Network architectures, Traditional Machine Learning enhanced methods, and Multi-Agent systems. Analysis of published results demonstrates significant performance improvements across multiple metrics including makespan reduction, cost optimization, and energy efficiency gains compared to traditional methods. The findings reveal that hybrid architectures combining multiple artificial intelligence and machine learning techniques consistently outperform single-method approaches, with edge computing environments showing the highest deployment readiness. Our analysis provides critical insights for both academic researchers and industry practitioners seeking to implement next-generation cloud resource allocation strategies in increasingly complex and dynamic computing environments.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- (9 more...)
- Overview (1.00)
- Research Report (0.84)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Law (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines
Efficient resource allocation is a key challenge in modern cloud computing. Over-provisioning leads to unnecessary costs, while under-provisioning risks performance degradation and SLA violations. This work presents an artificial intelligence approach to predict resource utilization in big data pipelines using Random Forest regression. We preprocess the Google Borg cluster traces to clean, transform, and extract relevant features (CPU, memory, usage distributions). The model achieves high predictive accuracy (R Square = 0.99, MAE = 0.0048, RMSE = 0.137), capturing non-linear relationships between workload characteristics and resource utilization. Error analysis reveals impressive performance on small-to-medium jobs, with higher variance in rare large-scale jobs. These results demonstrate the potential of AI-driven prediction for cost-aware autoscaling in cloud environments, reducing unnecessary provisioning while safeguarding service quality.
Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments
Zou, Xinkai, Jiang, Xuan, Huang, Ruikai, He, Haoze, Kapoor, Parv, Wu, Hongrui, Wang, Yibo, Sha, Jian, Shi, Xiongbo, Huang, Zixun, Zhao, Jinhua
Anomaly detection in cloud environments remains both critical and challenging. Existing context-level benchmarks typically focus on either metrics or logs and often lack reliable annotation, while most detection methods emphasize point anomalies within a single modality, overlooking contextual signals and limiting real-world applicability. Constructing a benchmark for context anomalies that combines metrics and logs is inherently difficult: reproducing anomalous scenarios on real servers is often infeasible or potentially harmful, while generating synthetic data introduces the additional challenge of maintaining cross-modal consistency. Ensuring the stability and availability of large-scale cloud systems is of great importance (Kazemzadeh & Jacobsen, 2009; Bu et al., 2018; Zhang et al., 2015). Accurate detection methods that can also identify among anomaly scenarios are essential to mitigate potential losses (Zhang et al., 2018; Barbhuiya et al., 2018a). Large-scale cloud systems usually generate abundant logs and expose various metrics, both of which serve as some of the most valuable data sources for anomaly detection (Lin et al., 2016; Nandi et al., 2016). Numerous benchmarks have been proposed for cloud anomaly detection such as (Oliner & Stearley, 2007; Xu et al., 2009; Akmeemana et al., 2025). However, most existing research and benchmarks for cloud anomaly detection have focused on point anomalies, where deviations are identified in isolation within a single modality, such as metrics or logs. Although these benchmarks have provided the community with relevant evaluation testbeds, they capture only a narrow slice of the anomaly landscape and often fail to reflect the complexity of real cloud environments.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.93)
- Energy (0.68)
CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload
Shahbazinia, Amirhossein, Huang, Darong, Costero, Luis, Atienza, David
Cloud platforms are increasingly relied upon to host diverse, resource-intensive workloads due to their scalability, flexibility, and cost-efficiency. In multi-tenant cloud environments, virtual machines are consolidated on shared physical servers to improve resource utilization. While virtualization guarantees resource partitioning for CPU, memory, and storage, it cannot ensure performance isolation. Competition for shared resources such as last-level cache, memory bandwidth, and network interfaces often leads to severe performance degradation. Existing management techniques, including VM scheduling and resource provisioning, require accurate performance prediction to mitigate interference. However, this remains challenging in public clouds due to the black-box nature of VMs and the highly dynamic nature of workloads. To address these limitations, we propose CloudFormer, a dual-branch Transformer-based model designed to predict VM performance degradation in black-box environments. CloudFormer jointly models temporal dynamics and system-level interactions, leveraging 206 system metrics at one-second resolution across both static and dynamic scenarios. This design enables the model to capture transient interference effects and adapt to varying workload conditions without scenario-specific tuning. Complementing the methodology, we provide a fine-grained dataset that significantly expands the temporal resolution and metric diversity compared to existing benchmarks. Experimental results demonstrate that CloudFormer consistently outperforms state-of-the-art baselines across multiple evaluation metrics, achieving robust generalization across diverse and previously unseen workloads. Notably, CloudFormer attains a mean absolute error (MAE) of just 7.8%, representing a substantial improvement in predictive accuracy and outperforming existing methods at least by 28%.
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- (2 more...)
SLA-Centric Automated Algorithm Selection Framework for Cloud Environments
Rizwan, Siana, Ahmed, Tasnim, Choudhury, Salimur
Cloud computing offers on-demand resource access, regulated by Service-Level Agreements (SLAs) between consumers and Cloud Service Providers (CSPs). SLA violations can impact efficiency and CSP profitability. In this work, we propose an SLA-aware automated algorithm-selection framework for combinatorial optimization problems in resource-constrained cloud environments. The framework uses an ensemble of machine learning models to predict performance and rank algorithm-hardware pairs based on SLA constraints. We also apply our framework to the 0-1 knapsack problem. We curate a dataset comprising instance specific features along with memory usage, runtime, and optimality gap for 6 algorithms. As an empirical benchmark, we evaluate the framework on both classification and regression tasks. Our ablation study explores the impact of hyperparameters, learning approaches, and large language models effectiveness in regression, and SHAP-based interpretability.
Cross-Cloud Data Privacy Protection: Optimizing Collaborative Mechanisms of AI Systems by Integrating Federated Learning and LLMs
In the age of cloud computing, data privacy protection has become a major challenge, especially when sharing sensitive data across cloud environments. However, how to optimize collaboration across cloud environments remains an unresolved problem. In this paper, we combine federated learning with large-scale language models to optimize the collaborative mechanism of AI systems. Based on the existing federated learning framework, we introduce a cross-cloud architecture in which federated learning works by aggregating model updates from decentralized nodes without exposing the original data. At the same time, combined with large-scale language models, its powerful context and semantic understanding capabilities are used to improve model training efficiency and decision-making ability. We've further innovated by introducing a secure communication layer to ensure the privacy and integrity of model updates and training data. The model enables continuous model adaptation and fine-tuning across different cloud environments while protecting sensitive data. Experimental results show that the proposed method is significantly better than the traditional federated learning model in terms of accuracy, convergence speed and data privacy protection.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
Adaptive Security Policy Management in Cloud Environments Using Reinforcement Learning
Saqib, Muhammad, Mehta, Dipkumar, Yashu, Fnu, Malhotra, Shubham
The securit y of cloud environments, such as Amazon Web Services (AWS), is complex and dynamic. St atic security policies have be come inadequate as threats evolve and cloud resources exhibit elasticity [1]. This paper addresses the limitations of static policies by proposing a security policy management framework that uses reinforcement learning (RL) to adapt dynamically. Specifically, we employ deep reinforcement learni ng algorithms, including deep Q Networks and proximal polic y op timization, enabling the learning and continuous adjustment of controls such as firewall rules and Identity an d Access Management (IAM) poli cies. The proposed RL based solution leverages cloud telemetry data (AWS Cloud Trail logs, network traffic data, threat intelligence feeds) to continuously refine security policies, maximizing threat mitigation, and compliance while minimizing resource impact. Experimental results d emonstrate that our adaptive RL bas ed framework significantly out performs static policies, achieving higher intrusion detection rates (92 % compared to 82% for static policies) and substantially reducing incident detection and response times by 58%. In a ddition, it maintains high con formity with security requirements and efficient resource usage. I. INTRODUCTION Cloud security is a critical concern as more orga nizations rely on cloud infras tructure. AWS an d other cloud platforms provide security configurations such as firewall rules and IAM policies, which are typically managed through static policies set by administrators. However, static policies cannot adapt to the dynamic nature of cloud environments, where workloads, users, and attack patterns change rapidly [1]. This rigidity exposes cloud deployments to new threats or misconfigurations that are not covered by static rules. For instance, static firewall rules may fail to detect novel attack patterns, and fixed IAM roles may become over privileged as resources scale, increasing risk . Problem Statement: Traditional cloud security policy management cannot keep pace with evolving threats and agile DevOps practices. M anual policy updates are error prone and slow.
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- Europe > Latvia > Riga Municipality > Riga (0.04)
- Asia > Middle East > Bahrain > Capital Governorate > Manama (0.04)
- (8 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
AI-Driven Security in Cloud Computing: Enhancing Threat Detection, Automated Response, and Cyber Resilience
Shaffi, Shamnad Mohamed, Vengathattil, Sunish, Sidhick, Jezeena Nikarthil, Vijayan, Resmi
Cloud security concerns have been greatly realized in recent years due to the increase of complicated threats in the computing world. Many traditional solutions do not work well in real-time to detect or prevent more complex threats. Artificial intelligence is today regarded as a revolution in determining a protection plan for cloud data architecture through machine learning, statistical visualization of computing infrastructure, and detection of security breaches followed by counteraction. These AI-enabled systems make work easier as more network activities are scrutinized, and any anomalous behavior that might be a precursor to a more serious breach is prevented. This paper examines ways AI can enhance cloud security by applying predictive analytics, behavior-based security threat detection, and AI-stirring encryption. It also outlines the problems of the previous security models and how AI overcomes them. For a similar reason, issues like data privacy, biases in the AI model, and regulatory compliance are also covered. So, AI improves the protection of cloud computing contexts; however, more efforts are needed in the subsequent phases to extend the technology's reliability, modularity, and ethical aspects. This means that AI can be blended with other new computing technologies, including blockchain, to improve security frameworks further. The paper discusses the current trends in securing cloud data architecture using AI and presents further research and application directions.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Washington > King County > Bellevue (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.70)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.66)